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Abstract We learn sensor trees from training data to minimize sensor acquisition 
costs during test time. Our system adaptively selects sensors at each stage if nec¬ 
essary to make a conhdent classification. We pose the problem as empirical risk 
minimization over the choice of trees and node decision rules. We decompose the 
problem, which is known to be intractable, into combinatorial (tree structures) and 
continuous parts (node decision rules) and propose to solve them separately. Using 
training data we greedily solve for the combinatorial tree structures and for the 
continuous part, which is a non-convex multilinear objective function, we derive 
convex surrogate loss functions that are piecewise linear. The resulting problem 
can be cast as a linear program and has the advantage of guaranteed conver¬ 
gence, global optimality, repeatability and computational efficiency. We show that 
our proposed approach outperforms the state-of-art on a number of benchmark 
datasets. 

Keywords Adaptive sensor selection • Resource-constrained learning • Test-time 
budgeted learning 


1 Introduction 

Many scenarios involve classification systems constrained by measurement acqui¬ 
sition budget. In this setting, a collection of sensor modalities with varying costs 
are available to the decision system. Our goal is to learn adaptive decision rules 
from labeled training data that when presented with a new unseen example would 
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select the most informative a nd cost-effective s tr ategy for the e xample. In con¬ 
trast to non-adaptive methods lEfron et all ( 20r)4h : IXu et all ( 2012ll . which attempt 
to identify a common sparse subset of sensors that can work well for all examples, 
our goal is an adaptive method that can classify typical cases using inexpensive 
sensors and using expensive sensors only for atypical cases. 

We propose to learn a sensor tree using labeled training examples for making 
decisions on unseen test examples. The learned sensor tree is composed of internal 
node decision rules. Given an example these decision rules select sensors and guide 
an example along a particular path terminating at a leaf where it is classihed with 
a classifier. We pose the problem as a global empirical risk minimization (ERM) 
over the choice of tree structures, node decision rules and leaf classifiers. 

The general problem is a highly coupled problem consisting of combinatorial 
(sensor tree structure) and continuous components (decision rules to generalize to 
unseen examples) and difficult to optimize. To gain further insight we abstract 
away the generalization aspect and observe that the resulting combinatorial prob¬ 
lem, a s pecial case of ours, is know n to be NP hard and requires greedy approxi¬ 
mations [Cha^r^arth^£t3 ( 2 OO 7 II : [Cicalese et all ( 2014ll . 

The combinatorial issue c an be circumvented in cases where expert knowledge 
exists , only a few sensors (a s in iTrapeznikov et all ( 20l'5l : lTrapeznikov and Saligramal 
( 2013l) : IWane: et 2014a) ') exist or for small depth trees. For these latter two cases 
we contruct an exhaustive tree and globally learn decision functions using a linear 
program by general i zing th e cascade structures presented in iTrapeznikov and Saligramal 
( 20131) : [Wang et ffil (l2014bll to binary trees, resulting in more flexible decision sys¬ 
tems. Convex surrogat es of products of indicators have previously been studied 
for supervised learning IWang and Saligramal ( 20131) . 

For more general cases we propose a two-step approach to decouple the issue 
of sensor structure from the decision rule design. 


— We greedily solve for the combinatorial tree structure and obtain feature/sensor 
sub-collections efficiently by a greedy approximation to the NP-hard problem. 
From these subsets, we construct a binary tree using hierarchical clustering of 
feature subsets. 

— On the learned tree structure, our problem now reduces to the ERM problem 
discussed above for a fixed tree where we apply a novel surrogate, allowing us 
to jointly learn the decision functions in the tree by solving a linear program. 


In the experiments, we demonstrate performance of our approach both for 
when feature subsets and tree structure is given and when feature subsets and 
tree structure must be learned. We show on real world data that our approach 
outperforms previously proposed approaches to budgeted learning. 


1.1 Related Work 


There is an extensive literature on adaptive methods for sensor selec tion for reduc¬ 
i ng tes t -time costs. I t argu ably originated with detection cascade fsee lZhang and Zhand 
( 201(ll) : I Chen et~ail ( 20121) and references therein), which is a popular method in 
reducing computation cost in object detection for cases with highly skewed class 
imbalance and generic features. Computationally cheap features are used at first 
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to filter out negative examples and the more expensive featnres are used in the 
later stages. 

Our technica l appro ach is closely related to lTrapeznikov and Saligramal ( 201.'lll 
and IWang et all ( 2014bll . Like us they formulate an ERM problem and generalize 
detection cascades to classifier cascad es and handle balanced and/or multi-class 
scenarios. Like us. IWang et all f 2014bll construct convex surrogates for their em¬ 
pirical risk functions and propose efficient LP solutions. Unlike us their approach 
is limited to cascades of known structure and cannot handle trees and unknown 
sensor structnres. _ _ _ 

Co nceptually, our work is closely related to IXu et a] (l2013ll and iKusner et all 
( 2014h . who introduced cost-sensitive tree of classifiers (CSTC) for reducing test 
time costs. Like our paper they proposed a global ERM problem. They solve for the 
tree structure, internal decision rul es and leaf cl a ssifier s jointly using alternative 
minimization techniques. Recently. iKusner et all ( 2014ll propose a more efficient 
version of CSTC. In contrast we decompose our global objective and separately 
solve the individual parts. The disadvantage of our decoupled approach is some¬ 
what offset by globally convergent solution for the decision rules once a structure 
is determined. Nevertheless, which approach is better is an important question 
that must be addressed but outside the scope of this work. 

The subject of this paper is broadly related to other adaptive methods in the 
literature but unlike us these methods do not learn sensor trees but learn policies 
from training data. Generat ive methods pose the problem as a POMDP, learn con ¬ 
ditional probability mod els Zubek and Dietterich (l2002ll: Sheng and Ling! ([200^ _ 

iBilgic and Getom (| 200’^ : ,Ti an d CarinI ( 2007|l : Kanani and Melvill6l ( 2008 h lKaDOor and Horvitj 

(l200f)li : iGao and Kolleil ( 201 ih and myopically select features base d information _ 

gain o f unknown featur e s. MDP-based methods Karavev et all ( 2013^ , iDulac- Arnold etlil 
feoilD . iHe et~a (I2OI2D . [Busa-Fekete et J (I201W encode current observations as 
state, unused features as action space, and formulate various reward functions 
to account for classihcation error and costs. He et. al. et all (|2012|l apply im- 
itation learning of a gre edy policy with a s ingle classification step as actions. 

IPulac- Arnold et"^ (|201l|) and Karwev^^d ( 2ni.‘lh apply reinforcement learning 
to solve this MDP. iBusa-Fekete et all ( 2012h propose classifier cascades within an 
MDP framework. They consider a fixed-ordering of features and extend sequential 
boosted classifier with an additional skip action. 


2 Problem Formulation: Global Empirical Risk Minimization 
Objective 

We consider learning an adaptive decision system with training examples 
..., {xntVn) with L sensors and acquisition cost Cm, m = 1, 2,..., L. We can pose 
the problem of learning a rooted sensor tree as an ERM problem. Our system is 
composed of three components, a tree, T, decision rules gj = gj G Gj 

associated with J internal nodes, and classifier = {fk}^=i, fk £ associated 
with leaves. Each internal node of the tree is associated with a sensor and its 
children denote available sensor choices. Each leaf, fe, is associated with a sensor 
subset, Sk, and corresponds to the unique path from the root to the leaf. The 
decision rule, gj, associated with node j acts upon acquired measurements for an 
example and routes it to one of its children. By uniquely associating each leaf, fc. 
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with the sensor outputs hk we can write ERM as: 


L(5,f,g)=^^ 


) lgj(a:i) = bfc 

m^Sk 

—> min min min L{S, f, g) 
S tK sj 


( 1 ) 


where a is a trade-off parameter balancing classification performance with sensor 
acquisition cost and S is the set of paths. An instance of this general problem has 
been c onsidered in the literature and shown to be NP hard IChakaravarthv et all 
( 2007ll for the special case of discrete valued sensor measurements, arbitrarily 
powerful decision rules, gj and with separable leaves, i.e, features acquired cor¬ 
responding to the leaf path uniquely & correctly identifies the class. The authors 
develop greedy algorithms for approximating their solution. In our setting we allow 
for continous valued high-dimensional measurements and so we cannot discretize 
our space. Furthermore we are not in a separable situation and the Bayes error 
is not zero. However, our general ERM problem is highly coupled, difficult to 
optimize. This motivates our proposed decomposition approach described below. 

Indeed, assuming arbitrary families Qj in Eq. [1] leads to new insights that 
motivates our decomposition approach. The general problem reduced to a purely 
combinatorial structure learning with arbitrary Qj when we also have access to 
an oracle classifier capable of classifying with any subset of acquired features. 
The resulting problem while NP hard as before is amenable to greedy strategies. 
Alternatively, given a tree structure and the oracle classifier the optimization ob¬ 
jective takes a multilinear form as evidenced by Eq. [1] which also lends itself to 
optimization strategies. The only issue remaining is that of an oracle classifier, 
which can be circumvented for our approach. This overall perspective justifies our 
approach: 

1. Learn the tree structure assuming powerful decision rules. 

2. Learn decision functions gi,... ,gK-i for the tree structure learned in Step 1. 


2.1 Learning Tree Structures Greedily 


For simplicity we assume a binary sen sor tree with K leaves and K — 1 inte rnal 
nodes. Motivated by previous methods iGao and Kolleil ( 201lh : IXu et all ( 2013ll we 
assume that the number of leaves K is small relative to the feature dimension. Our 
approach identifies a sub-collection of subsets of features <S = {S'!,..., Sk} from 
training data, and as such, the tree structure, T. Assuming arbitrarily powerful de¬ 
cision functions, gi,..., g^-i, effectively implies we can route each example to its 
optimal subset. Furthermore, assuming that we have access to an oracle classifer, 
fj’s we can predicted the class on any subset of features. Then the optimization 
loss of Eq. [1] associated with the subcollection S reduces to: 


1 


k^Sj 


Ck 


( 2 ) 


Even in the absence of noise, as noted before the problem of minimizing this 
loss is NP-hard and motivates greedy strategies. Additionally, we overcome the 
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issue of an oracle classifier by learning it as we grow the tree greedily. While many 
greedy strategies exist in the literature for related objectives , they are not directl y 
applicable to our setting. Our greedy algorithm is related to lAwasthi et all ( 2013h . 
who learn sparse trees/polynomials in the separable PAG setting and provide sta¬ 
tistical and computational guarantees. We adapt their approach to our setting as 
follows: given a sub-collection of features we expand this subcollection if there is a 
sensor that can further reduce our loss. If no sensor is found we restart the process 
and look for a new subset of sensors. 


Algorithm 1 Sensor Subset Selection 
Input: Number of Subsets K 
Output: Feature subsets, si,... ,sk 
Initialize: si,.. ., sj^ = 0 
for k=l,... ,K do 

j = argmin^.g,,c -C(si, ...,Sfc Uj, ...,sjy) 

while L(si, ...,Sfc U j, ...,sx) < £(si,..., sj., do 

^ j 

j = argmin^.g^c £(si, ...,Sfc U j, ...,sk) 
end while 
end for 


Tree Structure: Given the set of sensor subsets Si,..., Sjc, the problem of 
choosing a tree structure partitioning between these sensor subsets arises. We 
propose a hierarchical clustering approach, where subsets are grouped based on 
the number of common elements. Given a set of feature subsets, the two subsets 
with the highest number of common elements are grouped together and replaced 
in the set with the intersection of their elements. This is recursively repeated until 
only a single subset exists, resulting in a binary tree structure. This can be viewed 
as a generalization of the cascade approach, as given the set of feature subsets 
where an additional sensor is added to each previous subset, a cascade structure 
is always recovered. 

Once a tree structure, T is learned we need to populate it with decision func¬ 
tions so that we can generalize to unseen examples. Note that the learned sensor 
structure provides us with possible choices but does not tell us what choice to 
make on an unseen example. This motivates the following section. 


2.2 Empirical Risk Problem for a Fixed Tree 

We represent our decision system as a binary tree. The binary tree is composed 
of K leafs and K — I internal nodes. At each internal node, j = 1,...,A — 1, 
is a binary decision function, sign[g(j(x)] G {-|-1,—1}. This function determines 
which action should be taken for a given example. The binary decisions, c;j(x)’s, 
represent actions from the following set: stop and classify with the current set 
of measurements, or choose which sensor to acquire next. Each leaf node, k = 
1,... A, represents a terminal decision to stop and classify based on the available 

^ For instance ICicalese et a 3 1120141 proposes a submodular surrogate to leverage properties 
of Wolsey greedy algorithm but their surrogates require discrete sensor measurements. 
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Fig. 1 An example decision system of depth two: node gi{xi) selects either to acquire sensor 
2 for a cost C 2 or 3 for a cost C 3 . Node selects either to stop and classify with sensors 

{1,2} or to acquire 3 for C 3 and then stop. Node 53 ( 2 : 1 ,X 3 ) selects to classify with {1,3} or 
with {1, 2, 3}. 


informatiorfl We assume that the leaf classifiers, fj are known and fixe43. However 
note, the functions implicitly operate only on the sensors that have been acquired 
along the path to the corresponding node. The objective is to learn the internal 
decision functions: 5 j(x)’s. We define the system risk: 

K 

.R(g,x,i/) = y] i?fc(/fc,x,i/)Gfe(g,x) (3) 

k=l 

Here, g = { 51 ,.. .gK-i} is the set of decision functions. Rk{fk,y^,y) = + 

cxJ2meSk making a decision at a leaf k. It consists of two terms: 

error of the classifier at the leaf and the cost of sensors acquired along the path from 
the root node to the leaf. Sk is this set of sensors, and a is a parameter that controls 
trade-off between acquisition cost and classification error. Gfe(g,x) G {0,1} is a 
binary state variable indicating if x is classified at the kth leaf. We compactly 
encode the path from the root to every leaf in terms of internal decisions, gj (x)’s, 
by two auxiliary binary matrices: P, N G {0,1}^^^“^. If Pk,j = 1 then, on the 
path to leaf fc, a decision node j must be positive: gj > 0. If = 1 then on the 
path to leaf fc, a decision at node j must be negative: gj < 0. A fcth row in P and 
N jointly encode a path from the root node to a leaf k. The sign pattern for each 
path is obtained by P — N. For an example refer to Fig. [1] Using this path matrix, 
the state variable can be defined: Gfc(g,x) = • 

Our goal is to learn decision functions gi,...,gK-i that minimize the ex¬ 
pected system risk: miug E-p [f?(g, x, y)] However, the model "D is assumed to be 
unknown and cannot be estimated reliably due to potentially high-dimensional na¬ 
ture of sensor outputs. Instead, we are given a set of N training examples with full 
sensor measurements, (xi, yi), ..., (xjv, Vn)- We approximate the expected risk by 
a sample average over the data and construct the following ERM problem: 

jV N K ^ , 

mmy R(g,Xi,i/j) = y y Rfc(/fe,Xi,2/j) P (4) 

2=1 2=lfc=l 

--V-" 

Gfe(') = state of in a tree 

^ For notational simplicity, we denote applying a decision node and a leaf classifier as gj (x) 
and /a;(x) respectively. 

^ Note that the classifiers at each leaf, /fe(x) G T, can be learned for the K leaves once the 
tree structure, T, has been determined from the previous section. 
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Note that by the definition of risk in Q, the ERM problem can be viewed as 
a minimization over a fnnction of indicators with respect to decisions: gi(x),..., 


3 Convex Re-formulation and Solution by Linear Programming 


A popular approach to solving ERM problems is to substitute indicators with 
convex upper-bounding surrogates, 4>{z) > 1[^] and then to minimize the resulting 
surrogate risk. However, such strategy generally leads to a non-convex, multi¬ 
linear optimization problem. Previous attempts to solve problems of this form 
have focused on computationally costly alternating o ptimization approaches with 
guara nt ees of convergence on l y to a local minimum iTrapeznikov and Saligramal 
( 2 OI 3 II : IWang and Saligramal ( 2012h . Rather than attempting to solve this non- 
convex surrogate problem, we instead reformulate the indicator empirical risk in 
@ as a maximization over sums of indicators before introducing convex surrogate. 
Our approach yields a globally convex upper-bounding surrogate of the empirical 
loss function. In the next section, we derive this reformulation for a binary tree of 
depth 2 in Fig. [1] before generalizing to arbitrary trees. 


3.1 Simple Tree Example 

Consider the decision system shown in Fig. [1] The goal is to learn the decision 
functions 52 , and gs that minimize the empirical risk (la¬ 
in reformulating the risk, it is useful to define the ’’savings” for an example. 
The savings, tt^, for an example i, represents the difference between the worst case 
outcome, Rmax and the risk Rk{fk,^i,yi) for terminating and classifying at the 
fcth leaf. The worst case risk is acquiring all sensors and incorrectly classifying: 
Rmax = 1 “t“ P 

~ Rmax Rk{fk, Vi) — ^ fk{x.i) = yi T O; ^ ^ Cm (5) 

mES^ 


Here, is the complement set of sensors acquired along the path to leaf k (the 
sensors not acquired on the path to leaf k). Note that the savings do not depend 
on the decisions, g'jS, that we are interested in learning. 

For our example, there are only 4 leaf nodes and the state of terminating in 
a leaf is a encoded by a product of two indicators. For instance, to terminate 
in Leaf 1, 51 (xi) < 0 and g 2 (xi) < 0. This empirical risk can be formulated by 
enumerating over the leaves and their associated risks: 


Leaf 1 


Leaf 2 


5 2 / 1 ) — (^Rmax (^^rnax ^2^ lg'i(xi)<0^g2(xi)>0 

+ (^Rmax — '^3^1gi(xi)>olg3(Xj)<0 + i^Rmax — lgi(xi)>olg 2 (Xj )>0 (6) 


Leaf 3 


Leaf 4 
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Directly replacing every with an upper bounding surrogate such as a hinge 
loss, max[ 0 ,1 + z] > 1 [ 2 ], produces a non-convex bilinear objective due the indi¬ 
cator product terms. Bilinear optimization is computationally intractable to solve 
globally. 

Rather than directly substituting surrogates and solving the non-convex mini¬ 
mization problem, we reformulate the empirical risk with respect to the indicators 
in the following theorem: 

Theorem 31 The empirical risk in © is equal to ©• 

4 

-R(si,92,33,Xi,2/j) = R max + max [(tts + 7 r 4 )lgj(x,)<o + 

J=i 

(713 -H 7 r 4 )lgj(xi )<0 + + ’’'2)191 (Xi)>0 + ’'■ 4 lg 2 (xi)< 0 > 

(nl + 7r2)lgi(xi)>0 + ’’3lg3(xi)>o] (7) 

Proof Here, we provide a brief sketch of the proof. For full details please refer to 
the Appendix. We utilize the following two identities: 1 [ 2 ! 1 ] 1 [b] = niin[l[ 2 i], 1 [b]] 
and l[ 2 i] = 1 — Ij^j and express the risk in ([ 6 ]) in terms of maximizations: 

4 

-R(9i,52,53,Xi,yi) = R max E ”'9 A TTi max (lgi(xi)>o? lg 2 (xi)^o) 

7=1 

+772 max (lgi(x,)> 0 , lg 2 (xd<o) + TlS max (lgi(x,)< 0 , lg 3 (xO>o) 

+774 max ^Ig^ ^Xi)< 0 ? Ig3(xi)<o) (^) 

Recall that the signs of 51 , 52,93 encode a unique path for Xi. So let us consider 
sign patterns for each path. For instance, to reach leaf 1, 51 < 0 and 52 < 0. 
In this case, by inspection of ©, the risk is (773 + 774)l[gj(xi)<o] + ’’■2l[g2(xi)<o] + 
constants. This is exactly the hrst term in the maximization in O- We can perform 
such computation for each leaf (term in the max) in a similar fashion. And due to 
the interdependencies in (|8l), the term corresponding to a valid path encoding will 
be the maximizer in ©. 

Risk Interpretability: Intuitively, in the reformulated empirical risk in (O, 
each term in the maximization encodes a path to one of the K leaves. The largest 
(active) term correspond to the path induced by the gj’s for an example xi. Ad¬ 
ditionally, the weights on the indicators in © represent the savings lost if the 
argument of the indicator is active. For example, if the decision function 51 (xi) 
is negative, leaves 3 and 4 cannot be reached by Xi, and therefore 773 and 774 , the 
savings associated with leaves 3 and 4, cannot be realized and are lost. 

A distinct advantage of the reformulated risk in © arises when replacing 
indicators with convex upper-bounding surrogates of the form (l>{z) > l 2 <o. In¬ 
troducing such surrogates in the original risk in © produces a bilinear function 
for which a global optimum cannot be efficiently found. In contrast, introducing 
convex surrogate functions in © produces a convex upper-bound for the empirical 
risk. 
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3.2 Extension to Arbitrary Binary Trees 

In this section, we generalize the empirical risk reformulation for any binary tree 
and present a convex surrogate. Consider a binary tree, T, composed of A — 1 
internal nodes and K leaves. As defined in each leaf has a corresponding 
savings tt^ that captures the difference between the worst case risk and the risk of 
classifying at leaf k. 

Note that in the previous example, the risk in dl consists of a max of K 
terms. Each term is a weighted linear combinations of indicators, and each weight 
corresponds to the savings lost if the decision inside the indicator argument is 
true. For an arbitrary binary tree of K leaves, the risk has an analogous form. 

Before stating the result, we define the weights for the linear combination in 
each term of the max. For an internal node j, we denote as the set of leaf 
nodes in a subtree corresponding to a negative decision ( 7 j (x) < 0. And CJ is the 
set of leaf nodes in a subtree corresponding to a positive decision. For instance in 
Fig. HJ Cf = {Leaf ‘i,Leaf 4}, and in our example ([7]), the weight multiplying 
l[gi(xi)<o] is the sum of these savings for leaves 3 and 4 (i.e. savings lost if g\ < 0). 
Therefore, sets define which tt^ ’s contribute to a weight for a decision term. 

For a compact representation, recall that the kth rows in matrices P and N 
define a path to leaf k in terms of gi,..., gx-i, and a non-zero indicates 

if gj ^ 0 is on the path to leaf k. So for each Xi and each leaf k, we introduce two 
positive weight row vectors of length K — 1: 




'^p,k = 


TT;*, . . . , Tt\ 

lecf *ecg._i 

Pfc.i ^ ttI, ... ,Pk,K-i ^ ttI 


The jth component of multiplies l[g.(x )<o] term corresponding 

to the kth leaf. For instance in our 4 leaf example in ([T]), (w^ 2)2 = ’’"a + If 

P/Nfe, j is zero then decision ^ 0 is not on the path to leaf k and the weight is 
zero. Using these weight definitions, the empirical risk in © can be extended to 
arbitrary binary trees: 

Corollary 32 The empirical risk of tree T is: 


K 

? yi) — Rmax ^ ^ '^k 
k=l 


+ max w!, ^ 


1 

0 

A 

05 

1 _ 


^Sl(xd<0 


+ 




.lgK_i(xi)<0. 


(9) 


The proof of this corollary is included in the Appendix and follows the same steps 
as Thm. EH 
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The empirical risk in ([9|) represents a scan over the paths to each leaf {k = 
and the active term in the maximization corresponds to the leaf to 
which an observation is assigned by the decision functions gi,... ,gK-i- An im¬ 
portant observation is that each term in the max in ([9]) is a linear combination 
of indicators instead of a product as in ©■ This transformation enables us to 
upper-bound each indicator function with a convex surrogate, ^[ 5 j(x)] > 

^[g (x)>o] ’ — l[g (x)<o] ■ And the result is a novel convex upper-bound 

on the empirical risk in dSD- We denote this risk as i? 0 (g). And the optimiza¬ 
tion problem over a set of training examples, {xi, and a family of decision 

functions Q: miUgg^ J2^=i {s,^i,yi)- 


3.3 Linear Programming 

There are many valid choices for the surrogate However, if a hinge loss is 

used as an upper bound and Q is a family of linear functions of the data then the 
optimization problem in (13.211 becomes a linear program (LP). 

Proposition 33 For 4>{z) = max(l — z,0) and linear decision functions gi, ..., 
gK-i, the minimization in ([321) is equivalent to the following linear program: 


N 


a 


min 


N 




subject to: 





■ (31 ' 

z \ i 

7 > 


+ 





1 

_ 1 


1 + gj(xi) < a], 
1 - gj (xj) < P), 

a] > 0,13} > 0, 


i e [Af] 
ke[K 


!]■ 


i e [AT] 
k e [K]’ 


( 10 ) 


We introduce the variable 7 * for each example Xi to convert from a maximiza¬ 
tion over leaves to a set of linear constraints. Similarly, the maximization within 
each hinge loss is converted to a set of linear constraints. The variables a} upper- 
bound the indicator Ig^. (a;i)>o and the variables /3j upper-bound the indicator 
^gj{xi)<o- Additionally, the constant terms in the risk are removed for notational 
simplicity, as these do not effect the solution to the linear program. For details 
please refer to Appendix. 

Complexity: Linear programming is a relatively well-studied problem, with 
efficient algorithms previously developed. Specifically, for K leaves, N training 
points, and a maximum feature dimension of ZJ, we have 0{KD + KN) variables 
and 0{KN) constraints. The state of the art primal-dual methods for LP are fast 
in practice, with a n expected number of it erations 0{^/nlogn), where n is the 
number of variables I Anstreicher et a 1 (|l999ll . 
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Table 1 Small dataset descriptions 


Name 

Classes 

Stage 1 

Stage 2 

stage 3 

Stage 4 

letter 

26 

Pixel Count 

Moments 

Edge Features 

- 

landsat 

6 

Band 1 

Band 2 

Band 3 

Band 4 

SUN Mech. Turk 

16 

Function 

Materials 

Surf./Spat. Prop. 

- 

Image Seg. 

7 

Location 

Pixel Int. 

Color 



4 Experiments 


We demonstrate performance of our proposed approach on two types of real world 
data sets. First, we demonstrate performance of the LP formulation on examples 
with fixed structures. In these cases, sensor subsets do not need to be learned and 
an exhaustive tree can be constructed over all subsets of sensors. We compare the 
pe rformance of our approach to the alternating optimization scheme presented 
in iTrapeznikov and Saligramal ( 2013|l applied to the same tree, demonstrating ef¬ 
ficiency and performance of our LP formulation. Next, we apply our proposed 
approach to data sets where dimensionality prevents exhaustive search through 
feature subsets and both feature subsets and tree structure must be learned along 
with decision fu nctions in the tree. We show performance on classification data 
sets presented in iKusner et all (l2014h and comp are performance with Cost Sensi¬ 
tive Trees of Classifiers (CS TC) |Xujet_ 3 1201^ and Approximately Submodular 
Trees of Classifiers ('ASTCl iKusner et all ( 201^ . 

For all feature subset classifiers used in our proposed approach and the alter¬ 
nating optimization approach, classifiers are trained using logistic loss on 2"'‘^-order 
homogeneous polynomial expanded basis on the entire training set. 

Learning Decision Functions in Fixed Structures: Fig. [5] shows perfor¬ 
mance of our proposed approach on 5 data sets where few sensors are used and 
fixed structure can be easily found. For t h e letter, landsat, and SUN Mechanical 
Turk data sets iFrank and Asuncion! ( 2011)11 : IPatterson and Havsl (2012 ll . a tree can 
be easily con structed using all pos s ible f eature subsets. For the image segmenta¬ 
tion data set. lFrank and Asuncion! ( 201011 . we fix a greedily constructed tree with 
8 leaves. 

For comparison, we use the a lternating minimization approach proposed in 
ITrapeznikov and Saligramal ( 2013|l . Additionally, we also show performance of a 
simple myopic strategy for a baseline comparison on these example. The LP ap¬ 
proach generally performs comparably to the non-convex alternating optimization 
approach. Additionally, as shown in Table [21 the LP is dramatically faster during 
training time with comparable performance to the alternating training approachjj 
Note that we do not compare performance to ASTC or CSTC for these experi¬ 
ments. The purpose of these examples is to show the efficacy of the LP for training 
sequential decision functions independent of feature subsets, classifier design, or 
tree structure. Both CSTC and ASTC simultaneously learn the feature subsets, 
classifiers, and decision functions, limiting comparison. 

Learning Tree Structure: We next applied our proposed approach to three 
data sets where learning a tree over all feature subsets is infeasible. For all three 
datasets, we learn the set of subsets as described in Section |2l with a total of 


4 


All computations were performed on an Intel 15 M430 CPU @ 2.27 GHz with 4 cores. 
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Avg. Budget 


(c) Image Seg. 

Fig. 2 Comparison of error vs. average budget trade-off between a myopic, 
AM iTrapeznikov and Saligram i ll2013l~) . our LP method. LP clearly out performs the 
myopic approach, and generally matches or exceeds the non-convex approach with the added 
benefit of reduced computational cost, repeatability, and guaranteed convergence. 


Table 2 Average percentage of the budget required to achieve a desired error rate chosen 
to be close to the error achieved using the entire set of features (approximately 95% of the 
improvement gained using all features compared the initial features). The percentage of the 
budget required is with respect to the maximum budget. The training time is the amount of 
time (in seconds) required to learn a policy for a fixed budget trade-off parameter a. 


Dataset 

Target 

Errors 

Myopic 

AM 

LP 

AM Train 
Time(sec) 

LP Train 
Time(sec) 

letter 

40% 

73% 

48% 

49% 

93.56 

57.03 

landsat 

15% 

100 % 

75% 

75% 

186.0 

108.7 

SUN Mech. Turk 

40.4% 

99% 

90% 

90% 

2818.9 

71.08 

Image Seg. 

9% 

56% 

21 % 

26% 

46.26 

16.46 


16 subsets of features (leaves of the tree) used. For the MiniBooNE and forest 
data sets, the proposed approach outperforms CSTC, with performance exceeding 
ASTC for the forest data set and matching ASTC for MiniBooNE. On the CIFAR 
dataset, the proposed LP approach matches CSTC when using 50 features, but 
otherwise is generally outperformed by both CSTC and ASTC. We attribute this 
to the limited complexity (2”'^ order homogeneous polynomial) of the classification 
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functions, which requires more features to gain flexibility to accurately partition 
the data. 





Fig. 3 Plot of our proposed approach (LP tree) and CSTC on three real world data sets. On 
all three data sets, LP tree generally outperforms CSTC, producing high levels of classification 
accuracy with very low budgets. 


5 Conclusion 

A Proof of Theorem 1311 

The product of indicators can be expressed as a minimization over the indicators, allowing the 
empirical loss to be expressed: 

R{gi, 921 , 922 , Xi,Hi) = ^Rmax - TT J min (ig^ (. ) <o , Ig^ j (j;. ) <o) 


77^ min ^g 22 (.xi)>o) 
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By swapping the inequalities in the arguments of the indicator functions, the minimization 
functions can be converted to maximization functions: 

R{gi,g21,g22,Xi,yi) = Rmax + max ls2lC^i)>o) “ '^1 

+ 7 r 2 max ( 3 ;j )>0 ? ^92l(^i)^o) ^2 '^3 (^ffl ’ ^922(^i)>o) 

^3 ^4 14iax {lgj^(xj)<05 ^S'22(^i)^o) ^4^ ■ 

Note that due to the dependence of the indicators, there will always be 3 maximization terms 
equal to 1 and 1 maximization term equal to zero. As a result, the sum of maximizations can 
be expressed as a maximization over the 4 possible combinations, yielding the expression: 

R{ 91 , 921 , 922 ,Xi,yi) = ^Rmax - TT^ - TTj - TTg - 7 r| 

+ max ^(ttI + 7 r|)lg^ 

(Xi)<0 + ''^2^g2i{xi)<0, ('^3 + '^\)^gi{xi)<0 + '^l^g2l(xi)>0, 

(■k\ + 7 r 2 )lgj(a:J >0 + ''^\^g 2 l{x:i)< 0 ^ {’’^1 + ■'’'2)191 (xj)>0 + ''''3I921 (a:i)>o) ) ' 


B Proof of Corollary 1321 

The product of indicators over an arbitrary binary tree is given by: 


R(g, Xi,yi) = 

^ risk of leaf k 

Rk{fk,^i,yi) H [lgj(xi)>o]*’''’'[l93(xi)<o]'^''’'' 

k=l j=l 


state of Gfc(-) = in a tree 


Converting the product into a minimization over indicators, the function can be rewritten: 


K 

— 'y ^ i^Rtnax 
k=l 


4 ) ([l 9 i(x.)>o]*’'='^ [l 9 ,(xi)<o]'^'“'^) 


and using the identity = 1 — 1 ^, this can be converted to the maximization: 


R{S,^i,yi) = Rmax - 53 "’■fe + ' 






As in the 2-region case, the dependence of the indicators always results in A" — 1 maximization 
terms equal to 1 and 1 maximization term equal to 0. By examination, the sum of maximization 
functions can be expressed as a single maximization over the paths of the leaves, resulting in 
a loss shown in 10 . 
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C Additional Explanation of Prop. 4.1 

The linear program of Prop. [33] is constructed by replacing the indicators with hinge-losses of 
the appropriate signs: 


9l,--,9K-lyl :---n 
1 N ol aN 




subject to: 


■ ■ 


■ ' 


+ ^n,k 




1 

1 

_ 1 


1 + 9i(xi) < «*■, 
1 -gj(xi) < /3j, 
a) > 0 , > 0 , 


i e [N] 
ke[K 


!]• 


i e [Af] 
k G [K]’ 


( 11 ) 


Note that the linear program arises based on the fact that any maximization can be converted 
to a linear constraint with the introduction of a new variable. The maximization in the objective 
for each observation is replaced by the introduction of the variable 7 ^ and the first constraint. 
The maximization functions in the hinge losses are replaced by the remaining constraints, 
introducing the variables = max(l gj{'x.i),0) and = max(l — gj(xi), 0 ), respectively. 
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